ViralMSA: massively scalable reference |
您所在的位置:网站首页 › Massively scalable › ViralMSA: massively scalable reference |
Abstract
Motivation In molecular epidemiology, the identification of clusters of transmissions typically requires the alignment of viral genomic sequence data. However, existing methods of multiple sequence alignment (MSA) scale poorly with respect to the number of sequences. ResultsViralMSA is a user-friendly reference-guided MSA tool that leverages the algorithmic techniques of read mappers to enable the MSA of ultra-large viral genome datasets. It scales linearly with the number of sequences, and it is able to align tens of thousands of full viral genomes in seconds. However, alignments produced by ViralMSA omit insertions with respect to the reference genome. Availability and implementationViralMSA is freely available at https://github.com/niemasd/ViralMSA as an open-source software project. Supplementary informationSupplementary data are available at Bioinformatics online. 1 IntroductionReal-time or near real-time surveillance of the spread of a pathogen can provide actionable information for public health response (Poon et al., 2016). Though there is currently no consensus in the world of molecular epidemiology regarding a formal definition of what exactly constitutes a ‘transmission cluster’ (Novitsky et al., 2017), all current methods of inferring transmission clusters require a multiple sequence alignment (MSA) of the viral genomes: distance-based methods of transmission clustering require knowledge of homology for accurate distance measurement (Pond et al., 2018), and phylogenetic methods of transmission clustering require the MSA as a precursor to phylogenetic inference (Balaban et al., 2019; Prosperi et al., 2011; Ragonnet-Cronin et al., 2013). The standard tools for performing MSA such as MAFFT (Katoh and Standley, 2013), MUSCLE (Edgar, 2004), and Clustal Omega (Sievers and Higgins, 2014) are prohibitively slow for real-time pathogen surveillance as the number of viral genomes grows. For example, during the COVID-19 pandemic, the number of viral genome assemblies available from around the world grew exponentially in the initial months of the pandemic, but MAFFT, the fastest of the aforementioned MSA tools, scales quadratically with respect to the number of sequences. In the case of closely-related viral sequences for which a high-confidence reference genome exists, MSA can be accelerated by independently comparing each viral genome in the dataset against the reference genome and then using the reference as an anchor to merge the individual alignments into a single MSA (Pond et al., 2018). Here, we introduce ViralMSA, a user-friendly open-source MSA tool that utilizes read mappers such as Minimap2 (Li, 2018) to enable the reference-guided alignment of ultra-large viral whole-genome datasets. 2 Related workVIRULIGN is another reference-guided MSA tool designed for viruses (Libin et al., 2019). While VIRULIGN also aims to support MSA of large sequence datasets, its primary objective is to produce codon-correct MSAs (i.e. avoiding frameshifts), making it appropriate for aligning coding regions, whereas ViralMSA’s primary objective is to align whole viral genomes in real-time. Further, ViralMSA is orders of magnitude faster than VIRULIGN (Fig. 1) and uses a fraction of the memory. Fig. 1.![]() Execution time. Execution time for SARS-CoV-2 MSAs (genome length 29 kb) estimated by VIRULIGN, MAFFT, and ViralMSA for various dataset sizes. All runs were executed sequentially on an 8-core 2.0 GHz Intel Xeon CPU with 30 GB of memory 3 Results and discussionViralMSA is written in Python 3 and is thus cross-platform. ViralMSA depends on BioPython (Cock et al., 2009) and whichever read mapper the user chooses, which is Minimap2 by default (Li, 2018). In addition to Minimap2, ViralMSA supports STAR (Dobin et al., 2013), Bowtie 2 (Langmead and Salzberg, 2012) and HISAT2 (Kim et al., 2019), though the default of Minimap2 is strongly recommended: Minimap2 is much faster than the others (Li, 2018) and is the only mapper that consistently succeeds to align all genome assemblies against an appropriate reference across multiple viruses. ViralMSA’s support for read mappers other than Minimap2 is primarily to demonstrate that ViralMSA is flexible, meaning it will be simple to incorporate new read mappers in the future. ViralMSA takes the following as input: (i) a FASTA file containing the viral genomes to align, (ii) the GenBank accession number of the reference genome, and (iii) the mapper to utilize (Minimap2 by default). ViralMSA will pull the reference genome from GenBank and generate an index using the selected mapper, both of which will be cached for future alignments of the same viral strain, and will then execute the mapping. ViralMSA will then process the results and output an MSA in the FASTA format. For commonly, studied viruses, the user can simply provide the name of the virus instead of an accession number, and ViralMSA will select an appropriate reference genome. The user can also choose to provide a local FASTA file containing a reference genome, which may be useful if the desired reference does not exist on GenBank or if the user wishes to conduct the analysis offline. Because it uses the positions of the reference genome as anchors with which to merge the individual pairwise alignments, ViralMSA only keeps matches, mismatches, and deletions with respect to the reference genome: it discards all insertions with respect to the reference genome. For closely-related viral strains, insertions with respect to the reference genome are typically unique and thus lack usable phylogenetic or transmission clustering information, so their removal results in little to no impact on downstream analyses (Table 1). Table 1.MSA accuracy Virus . MAFFT (S) . ViralMSA (S) . MAFFT (P) . ViralMSA (P) . Ebola 0.9957 0.9873 0.9998 0.9816 HCV 0.9995 0.9506 0.9999 0.9678 HIV-1 0.9786 0.9705 0.9957 0.9941 Virus . MAFFT (S) . ViralMSA (S) . MAFFT (P) . ViralMSA (P) . Ebola 0.9957 0.9873 0.9998 0.9816 HCV 0.9995 0.9506 0.9999 0.9678 HIV-1 0.9786 0.9705 0.9957 0.9941 Open in new tab Table 1.MSA accuracy Virus . MAFFT (S) . ViralMSA (S) . MAFFT (P) . ViralMSA (P) . Ebola 0.9957 0.9873 0.9998 0.9816 HCV 0.9995 0.9506 0.9999 0.9678 HIV-1 0.9786 0.9705 0.9957 0.9941 Virus . MAFFT (S) . ViralMSA (S) . MAFFT (P) . ViralMSA (P) . Ebola 0.9957 0.9873 0.9998 0.9816 HCV 0.9995 0.9506 0.9999 0.9678 HIV-1 0.9786 0.9705 0.9957 0.9941 Open in new tabCorrelation coefficients are shown for Mantel tests between curated ‘ground truth’ MSAs and those estimated by MAFFT and ViralMSA. S and P denote Spearman and Pearson correlation, respectively. 1 indicates perfect correlation, −1 indicates perfect anticorrelation, and 0 indicates no correlation. In order to assess MSA estimation accuracy, we obtained curated Ebola, HCV, and HIV-1 full-genome MSAs from the Los Alamos National Laboratory (LANL) sequence databases, which we used as our ground truth. In order to benchmark MSA runtime, we obtained a large collection of SARS-CoV-2 complete genomes from the Global Initiative on Sharing All Influenza Data (GISAID) database. VIRULIGN crashed when run on all datasets aside from the SARS-CoV-2 dataset. To measure performance, we subsampled the full SARS-CoV-2 dataset, with 10 replicates for each dataset size, and then computed MSAs of each replicate. ViralMSA is consistently orders of magnitude faster than both MAFFT and VIRULIGN (Fig. 1 and Supplementary Fig. S1). Further, for all SARS-CoV-2 datasets, both ViralMSA and MAFFT required Google Scholar CrossrefSearch ADS PubMedWorldCat Dobin A. et al. ( 2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 29, 15–21.Google Scholar CrossrefSearch ADS PubMedWorldCat Edgar R.C. ( 2004) MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinform., 5, 113.Google Scholar CrossrefSearch ADSWorldCat Katoh K. , Standley D.M. ( 2013) MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol., 30, 772–780.Google Scholar CrossrefSearch ADS PubMedWorldCat Kim D. et al. ( 2019) Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol., 37, 907–915.Google Scholar CrossrefSearch ADS PubMedWorldCat Langmead B. , Salzberg S.L. ( 2012) Fast gapped-read alignment with Bowtie 2. Nat. Methods, 9, 357–359.Google Scholar CrossrefSearch ADS PubMedWorldCat Li H. ( 2018) Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34, 3094–3100.Google Scholar CrossrefSearch ADS PubMedWorldCat Libin P.J.K. et al. ( 2019) VIRULIGN: fast codon-correct alignment and annotation of viral genomes. Bioinformatics, 35, 1763–1765.Google Scholar CrossrefSearch ADS PubMedWorldCat Novitsky V. et al. ( 2017) Phylogenetic inference of HIV transmission clusters. Infect. Dis. Transl. Med., 3, 51–59.Google Scholar OpenURL Placeholder TextWorldCat Piñeiro C. et al. ( 2020) VeryFastTree: speeding up the estimation of phylogenies for large alignments through parallelization and vectorization strategies, 36, 4658–4659.OpenURL Placeholder TextWorldCat Pond S.L.K. et al. ( 2018) HIV-TRACE (TRAnsmission Cluster Engine): a tool for large scale molecular epidemiology of HIV-1 and other rapidly evolving pathogens. Mol. Biol. Evol., 35, 1812–1819.Google Scholar CrossrefSearch ADS PubMedWorldCat Poon A.F.Y. et al. ( 2016) Near real-time monitoring of HIV transmission hotspots from routine HIV genotyping: an implementation case study. Lancet HIV, 3, e231–e238.Google Scholar CrossrefSearch ADS PubMedWorldCat Prosperi M.C.F. et al. ( 2011) A novel methodology for large-scale phylogeny partition. Nat. Commun., 2, 321.Google Scholar CrossrefSearch ADS PubMedWorldCat Ragonnet-Cronin M. et al. ( 2013) Automated analysis of phylogenetic clusters. BMC Bioinform., 14, 317.Google Scholar CrossrefSearch ADSWorldCat Robinson D.F. , Foulds L.R. ( 1981) Comparison of phylogenetic trees. Math. Biosci., 53, 131–147.Google Scholar CrossrefSearch ADSWorldCat Sievers F. , Higgins D.G. ( 2014) Clustal Omega, accurate alignment of very large numbers of sequences. Methods Mol. Biol., 1079, 105–116.Google Scholar CrossrefSearch ADS PubMedWorldCat Tamura K. , Nei M. ( 1993) Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol., 10, 512–526.Google Scholar PubMedOpenURL Placeholder TextWorldCat Tavaré S. ( 1986) Some probabilistic and statistical problems in the analysis of DNA sequences. Lectures Math. Life Sci., 17, 57–86.Google Scholar OpenURL Placeholder TextWorldCat © The Author(s) 2020. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: [email protected] article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) |
CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3 |